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LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS 
FOR HIGH-DIMENSIONAL DATA 

By Nicolai Meinshausen 1 and Bin Yu 2 

University of Oxford and University of California, Berkeley 

The Lasso is an attractive technique for regularization and vari- 
able selection for high-dimensional data, where the number of pre- 
dictor variables p n is potentially much larger than the number of 
samples n. However, it was recently discovered that the sparsity pat- 
tern of the Lasso estimator can only be asymptotically identical to 
the true sparsity pattern if the design matrix satisfies the so-called 
vrrepresentable condition. The latter condition can easily be violated 
in the presence of highly correlated variables. 

Here we examine the behavior of the Lasso estimators if the irrep- 
resentable condition is relaxed. Even though the Lasso cannot recover 
the correct sparsity pattern, we show that the estimator is still con- 
sistent in the ^2-norm sense for fixed designs under conditions on (a) 
the number s n of nonzero components of the vector (3 n and (b) the 
minimal singular values of design matrices that are induced by se- 
lecting small subsets of variables. Furthermore, a rate of convergence 
result is obtained on the £2 error with an appropriate choice of the 
smoothing parameter. The rate is shown to be optimal under the 
condition of bounded maximal and minimal sparse eigenvalues. Our 
results imply that, with high probability, all important variables are 
selected. The set of selected variables is a meaningful reduction on 
the original set of variables. Finally, our results are illustrated with 
the detection of closely adjacent frequencies, a problem encountered 
in astrophysics. 

1. Introduction. The Lasso was introduced by [29] and has since been 
proven to be very popular and well studied [18, 35, 41, 42]. Some reasons for 
the popularity might be that the entire regularization path of the Lasso can 
be computed efficiently [11, 25], that Lasso is able to handle more predictor 
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variables than samples and produces sparse models which are easy to inter- 
pret. Several extensions and variations have been proposed [5, 21, 36, 40, 42]. 

1.1. Lasso-type estimation. The Lasso estimator, as introduced by [29], 
is given by 



where X = (X±, . . . ,X p ) is the n x p matrix whose columns consist of the 
n-dimensional fixed predictor variables Xk, k= 1, . . . , p. The vector Y con- 
tains the n-dimensional set of real-valued observations of the response vari- 
able. 

The distribution of Lasso-type estimators has been studied in Knight and 
Fu [18]. Variable selection and prediction properties of the Lasso have been 
studied extensively for high-dimensional data with p n ^$> n, a frequently en- 
countered challenge in modern statistical applications. Some studies Bunea, 
Tsybakov and Wegkamp, for example, [2], Greenshtein and Ritov, for exam- 
ple, [13], van de Geer, for example, [34] have focused mainly on the behavior 
of prediction loss. Much recent work aims at understanding the Lasso esti- 
mates from the point of view of model selection, including Candes and Tao 
[5], Donoho, Elad and Temlyakov [10], Meinshausen and Buhlmann [23], 
Tropp [30], Wainwright [35], Zhao and Yu [41], Zou [42]. For the Lasso esti- 
mates to be close to the model selection estimates when the data dimensions 
grow, all the aforementioned papers assumed a sparse model and used vari- 
ous conditions that require the irrelevant variables to be not too correlated 
with the relevant ones. Incoherence is the terminology used in the determin- 
istic setting of Donoho, Elad and Temlyakov [10] and "irrepresentability" is 
used in the stochastic setting (linear model) of Zhao and Yu [41]. Here we 
focus exclusively on the properties of the estimate of the coefficient vector 
under squared error loss and try to understand the behavior of the estimate 
under a relaxed irrepresentable condition (hence we are in the stochastic 
or linear model setting). The aim is to see whether the Lasso still gives 
meaningful models in this case. 

More discussions on the connections with other works will be covered in 
Section 1.5 after notions are introduced to state explicitly what the irrepre- 
sentable condition is so that the discussions are clearer. 

1.2. Linear regression model. We assume a linear model for the obser- 
vations of the response variable Y = (Y\, . . . , Y n ) T , 



where e = (e±, . . . ,e n ) T is a vector containing independently and identically 
distributed noise with e$ ~ A/"(0, a 2 ) for all i = 1, . . . , n. The assumption of 



(1) 




(2) 



Y = X(3 + e 
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Gaussianity could be relaxed and replaced with exponential tail bounds on 
the noise if, additionally, predictor variables are assumed to be bounded. 
When there is a question of nonidentifiability for f3, for p n > n, we define (5 
as 

(3) (3= argmin 

{/3:EY=Xf3} 

The aim is to recover the vector (3 as well as possible from noisy obser- 
vations Y . For the equivalence between l\- and ^o- s P arse solutions see, for 
example, Donoho [8], Donoho and Elad [9], Fuchs [12], Gribonval and Nielsen 
[14], Tropp [30, 31]. 

1.3. Recovery of the sparsity pattern and the irrepresentable condition. 
There is empirical evidence that many signals in high-dimensional spaces 
allow for a sparse representation. As an example, wavelet coefficients of 
images often exhibit exponential decay, and a relatively small subset of all 
wavelet coefficients allow a good approximation to the original image [17, 19, 
20]. For conceptual simplicity, we assume in our regression setting that the 
vector P is sparse in the ^o _sense an d many coefficients of (3 are identically 
zero. The corresponding variables have thus no influence on the response 
variable and could be safely removed. The sparsity pattern of j3 is understood 
to be the sign function of its entries, with sign(x) = if x = 0, sign(x) = 1 if 
x > and sign(a;) = — 1 if x < 0. The sparsity pattern of a vector might thus 
look like 

sign(/3) = (+1, -1, 0, 0,+l,+l,-l,+l, 0, 0,...), 

distinguishing whether variables have a positive, negative or no influence at 
all on the response variable. It is of interest whether the sparsity pattern of 
the Lasso estimator is a good approximation to the true sparsity pattern. If 
these sparsity patterns agree asymptotically, the estimator is said to be sign 
consistent [41]. 

Definition 1 (Sign consistency). An estimator j3 x is sign consistent if 
and only if 

P{sign(/3) = sign(/3)} — > 1 as n — >oo. 

It was shown independently in Zhao and Yu [41] and Zou [42] in the lin- 
ear model case and [23] in a Gaussian Graphical Model setting that sign 
consistency requires a condition on the design matrix. The assumption was 
termed neighborhood stability in Meinshausen and Biihlmann [23] and irrep- 
resentable condition in Zhao and Yu [41]. Let C = n~ 1 X T X. The dependence 
on n is neglected notationally. 
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Definition 2 (Irrepresentable condition). Let K = {k : (3k ^ 0} De the 
set of relevant variables and let N = {1, . . . ,p}\K be the set of noise vari- 
ables. The sub-matrix Chk is understood as the matrix obtained from C 
by keeping rows with index in the set H and columns with index in K. The 
irrepresentable condition is fulfilled if 



In Zhao and Yu [41], an additional strong irrepresentable condition is de- 
fined which requires that the above elements are not merely smaller than 
1 but are uniformly bounded away from 1. Zhao and Yu [41], Zou [42] and 
Meinshausen and Biihlmann [23] show that the Lasso is sign consistent only 
if the irrepresentable condition holds. 

Proposition 1 (Sign consistency). Assume that the irrepresentable con- 
dition or neighborhood stability is not fulfilled. Then there exists no sequence 
A = \ n such that the estimator (3 X is sign consistent. 

It is worth noting that a slightly stronger condition has been used in 
Tropp [30, 31] in a deterministic study of Lasso's model selection properties 
where 1 — CnkC]^ is called ERC (exact recovery coefficient). A positive 
ERC implies the irrepresentable condition for all (3 values. 

In practice, it might be difficult to verify whether the condition is ful- 
filled. This led various authors to propose interesting extensions to the Lasso 
[22, 39, 42]. Before giving up on the Lasso altogether, however, we want to 
examine in this paper in what sense the original Lasso procedure still gives 
sensible results, even if the irrepresentable condition or, equivalently, neigh- 
borhood stability is not fulfilled. 

1.4. £2- consistency. The aforementioned studies showed that if the ir- 
representable condition is not fulfilled, the Lasso cannot select the correct 
sparsity pattern. In this paper we show that the Lasso selects in these cases 
the nonzero entries of (3 and some not-too-many additional zero entries of 
(3 under relaxed conditions than the irrepresentable condition. The nonzero 
entries of (3 are in any case included in the selected model. Moreover, the 
size of the estimated coefficients allows to separate the few truly zero and 
the many nonzero coefficients. However, we note that in extreme cases, when 
the variables are linearly dependent, even these relaxed conditions will be 
violated. In these situations, it is not sensible to use the ^-metric on (3 to 
assess Lasso. 

Our main result shows the ^-consistency of the Lasso, even if the irrep- 
resentable condition is violated. To be precise, an estimator is said to be 
^2-consistent if 



||C^C^sign(%)||^<l. 



(4) 




as n 



00. 
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Rates of convergence results will also be derived and under the condition 
of bounded maximal and minimal sparse eigenvalues, the rate is seen op- 
timal. An ^-consistent estimator is attractive, as important variables are 
chosen with high probability and falsely chosen variables have very small 
coefficients. The bottom line will be that even if the sparsity pattern of j3 
cannot be recovered by the Lasso, we can still obtain a good approximation. 

1.5. Related work. Prediction loss for high-dimensional regression under 
an £i-penalty has been studied for quadratic loss function in Greenshtein and 
Ritov [13] and for general Lipschitz loss functions in van de Geer [34]. With 
a focus on aggregation, similarly interesting results are derived in Bunea, 
Tsybakov and Wegkamp [3]. Both van de Geer [34] and Bunea, Tsybakov 
and Wegkamp [3] obtain impressive results for random design and sharp 
bounds for the £i-distance between the vector [3 and its Lasso estimate /3\ 
In the current manuscript, we focus on the ^-estimation loss on [3. As a 
consequence, we can derive consistency in the sense of (4) under the condi- 
tion that s n \ogp n /n — > for n — > oo (ignoring logn factors). An implication 
of our work is thus that the sparsity s n is allowed to grow almost as fast 
as the sample size if one is interested to obtain convergence in i^-norm. In 
contrast, the results in [3, 34] require s n = o(^/n) to obtain convergence in 
^i-norm. 

The recent independent work of Zhang and Huang [38] shows that the 
subspace spanned by the variables selected by Lasso is close to an optimal 
subspace. The results also imply that important variables are chosen with 
high probability and provides a tight bound on the ^-distance between the 
vector j3 and its Lasso estimator. A "partial Riesz condition" is employed 
in [38], which is rather similar to our notion of incoherent design, defined 
further below in (6). 

We would like to compare the results of this manuscript briefly with results 
in Donoho [8] and Candes and Tao [5] , as both of these papers derive bounds 
on the i^-norm distance between (3 and (3 for ^i-norm constrained estimators. 
In Donoho [8] the design is random and the random predictor variables are 
assumed to be independent. The results are thus not directly comparable 
to the results derived here for general fixed designs. Nevertheless, results in 
Meinshausen and Biihlmann [23] suggest that the irrepresentable condition is 
with high probability fulfilled for independently normal distributed predictor 
variables. The results in Donoho [8] can thus not directly be used to study 
the behavior of the Lasso under a violated irrepresentable condition, which 
is our goal in the current manuscript. 

Candes and Tao [5] study the properties of the so-called "Dantzig selec- 
tor," which is very similar to the Lasso, and derive bounds on the ^-distance 
between the vector (3 and the proposed estimator /?. The results are derived 
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under the condition of a Uniform Uncertainty Principle (UUP), which was 
introduced in Candes and Tao [4]. The UUP is related to our assumptions 
on sparse eigenvalues in this manuscript. A comparison between these two 
assumptions is given after the formulation (10) of the UUP. The bounds on 
the ^-distance between the true coefficient vector (3 and its Lasso estima- 
tor (obtained in the current manuscript) or, respectively, "Dantzig selector" 
(obtained in [5]) are quite similar in nature. This comes maybe as no surprise 
since the formulation of the "Dantzig selector" is quite similar to the Lasso 
[24]. However, it does not seem straightforward to translate the bounds 
obtained for the "Dantzig selector" into bounds for the Lasso estimator 
and vice versa. We employ also somewhat different conditions because there 
could be situations of design matrix arising in statistical practice where the 
dependence between the predictors is stronger than what is allowed by the 
UUP, but would satisfy our condition of "incoherent design" to be defined 
in the next section. It would certainly be of interest to study the connection 
between the Lasso and "Dantzig selector" further, as the solutions share 
many similarities. 

Final note: a recent follow-up work [1] provides similar bounds as in this 
paper for both Lasso and Dantzig selectors. 

2. Main assumptions and results. First, we introduce the notion of sparse 
eigenvalues, which will play a crucial role in providing bounds for the conver- 
gence rates of the Lasso estimator. Thereafter, the assumptions are explained 
in detail and the main results are given. 

2.1. Sparse eigenvalues. The notion of sparse eigenvalues is not new and 
has been used before [8]; we merely intend to fixate notation. The m-sparse 
minimal eigenvalue of a matrix is the minimal eigenvalue of any m x Tri- 
dimensional submatrix. 

Definition 3. The m-sparse minimal eigenvalue and m-sparse maximal 
eigenvalue of C are defined as 



The minimal eigenvalue of the unrestricted matrix C is equivalent to 
4>min{p)- If the number of predictor variables p n is larger than sample size, 
p n > n, this eigenvalue is zero, as </ > min(^) — for any m > n. 

A crucial factor contributing to the convergence of the Lasso estimator 
is the behavior of the smallest m-sparse eigenvalue, where the number m 
of variables over which the minimal eigenvalues is computed is roughly the 
same order as the sparsity s n , or the number of nonzero components, of the 
true underlying vector (5. 



(5) (j)min{m) 




and 
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2.2. Sparsity multipliers and incoherent designs. As apparent from the 
interesting discussion in Candes and Tao [5], one cannot allow arbitrarily 
large "coherence" between variables if one still hopes to recover the correct 
sparsity pattern. Assume that there are two vectors (3 and (3 so that the 
signal can be represented by either vector X(3 = X(3 and both vectors are 
equally sparse, say \\(3\\e. Q = \\(3\\o = s n and are not identical. We have no 
hope of distinguishing between f3 and (3 in such a case: if indeed Xj3 = Xj3 
and (3 and (3 are not identical, it follows that the minimal sparse eigenvalue 
</'mm(2sn) = vanishes as X{(3 — $) = and ||/3 — (3\\i < 2s n . If the minimal 
sparse eigenvalue of a selection of 2s n variables is zero, we have no hope of 
recovering the true sparse underlying vector from noisy observations. 

To define our assumption about sufficient conditions for recovery, we need 
the definition of incoherent design. As motivated by the example above, we 
would need a lower bound on the minimal eigenvalue of at least 2s n variables, 
where s n is again the number of nonzero coefficients. We now introduce the 
concepts of sparsity multiplier ad incoherent design to make this requirement 
a bit more general, as minimal eigenvalues are allowed to converge to zero 
slowly. 

A design is called incoherent in the following if minimal sparse eigenvalues 
are not decaying too fast, in a sense made precise in the definition below. 
For notational simplicity, let in the following 

0max = </>max(Sn + mhl{?l,p n }) 

be the maximal eigenvalue of a selection of at most s n + mm{n,p n } variables. 
At the cost of more involved proofs, one could also work with the maximal 
eigenvalue of a smaller selection of variables instead. Even though we do 
not assume an upper bound for the quantity (j) max , it would not be very 
restrictive to do so for the p n ^$> n setting. To be specific, assume multivariate 
normal predictors. If the maximal eigenvalue of the population covariance 
matrix, which is induced by selecting 2n variables, is bounded from above 
by an arbitrarily large constant, it follows by Theorem 2.13 in Davidson and 
Szarek [7] or Lemma A3.1 in Paul [26] that the condition number of the 
induced sample covariance matrix observes a Gaussian tail bound. Using an 
entropy bound for the possible number of subsets when choosing n out of 
p n variables. The maximal eigenvalue of a selection of 2min{n,p} variables 
is thus bounded from above by some constant, with probability converging 
to 1 for n — > oo under the condition that logp n = o{n K ) for some k < 1, and 
the assumption of a bounded </> m ax) even though not needed, is thus maybe 
not overly restrictive. 

As the maximal sparse eigenvalue is typically growing only very slowly 
as a function of the number of variables, the focus will be on the decay of 
the smallest sparse eigenvalue, which is a much more pressing problem for 
high-dimensional data. 
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Definition 4 (Incoherent designs). A design is called incoherent if there 
exists a positive sequence e n , the so-called sparsity multiplier sequence, such 
that 

(6) lim inf ^minfe) > lg 

n^oo cf) mSLX (s n + mm{n,p n \) 

Our main result will require incoherent design. The constant 18 could 
quite possibly be improved upon. We will assume for the following that the 
multiplier sequence is the smallest. Below, we give some simple examples 
under which the condition of incoherent design is fulfilled. 

2.2.1. Example: block designs. The first example is maybe not overly 
realistic but gives, hopefully, some intuition for the condition. A "block 
design" is understood to have the structure 



(7) n- 1 X T X 



/S(l) ••• \ 
E(2) ••• 



V ••• S(d)/ 

where the matrices E(l), . . . , E(d) are of dimension b(l), b(d), respec- 
tively. The minimal and maximal eigenvalues over all d sub-matrices are 
denoted by 

i block ■ ■ u T T,(k)u jb]ork u T Y>(k)u 
C^:=min mm t^—, C» :=max max 

k ue WW u T u k uG Rf>(fe) u T u 

In our setup, all constants are allowed to depend on the sample size n. 
The question arises if simple bounds can be found under which the design is 
incoherent in the sense of (6). The blocked sparse eigenvalues are trivial lower 
and upper bounds, respectively, for ^min(w) and ^ max ()/) for all values of u. 
Choosing e n such that e\s n = o(n), the condition (6) of incoherent design 
requires then e n 4> min (els n ) > (j> max (s n + min{n,p n }). Using (j) min (els n ) > 
^min k an d ^max < ^Sa? > it is sufficient if there exists a sequence e n with e n = 
°( < / ) max k / ( / ) mfn k )- Together with the requirement e^s n = o(n), the condition 
of incoherent design is fulfilled if, for n — > oo, 



(8) s n = o 



where the condition number c n is given by 

(q) ._ -block/ -block 

\ v ) L n ■— Vmax / Ymin • 

Under increasingly stronger assumption on the sparsity, the condition num- 
ber c n can thus grow almost as fast as y/n, while still allowing for incoherent 
design. 
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2.2.2. More examples of incoherent designs. Consider two more exam- 
ples of incoherent design: 

• The condition (6) of incoherent design is fulfilled if the minimal eigenvalue 
of a selection of s n (logre) 2 variables is vanishing slowly for n — > oo so that 

(logn) 2 } » (f> 

max! Sn 

logn 

• The condition is also fulfilled if the minimal eigenvalue of a selection of 
n a s n variables is vanishing slowly for n — > oo so that 

4>min{n a s n ) > n~ a/2 max . 

These results can be derived from (6) by choosing the sparse multiplier 
sequences e n = log n and e n = ?W 2 , respectively. Some more scenarios of 
incoherent design can be seen to satisfy (6). 

2.2.3. Comparison with the uniform uncertainty principle. Candes and 
Tao [5] use a Uniform Uncertainty Principle (UUP) to discuss the conver- 
gence of the so-called Dantzig selector. The UUP can only be fulfilled if the 
minimal eigenvalue of a selection of s n variables is bounded from below by 
a constant, where s n is again the number of nonzero coefficients of (5. In the 
original version, a necessary condition for UUP is 

(10) 0min(s n ) + </Wi(2s n ) + <Amin(3s n ) > 2 - 

At the same time, a bound on the maximal eigenvalue is a condition for the 
UUP in [5], 

(11) ^max(Sn) + <Amax(2s n ) + <?Wx(3s n ) < 4. 

This UUP condition is different from our incoherent design condition. In 
some sense, the UUP is weaker than incoherent design, as the minimal eigen- 
values are calculated over only 3s n variables. In another sense, UUP is quite 
strong as it demands, in form (10) and assuming s n ^ 2, that all pairwise 
correlations between variables be less than 1/3! The condition of incoherent 
design is weaker as the eigenvalue can be bounded from below by an arbi- 
trarily small constant (as opposed to the large value implied by the UUP). 
Sparse eigenvalues can even converge slowly to zero in our setting. 

Taking the example of block designs from further above, incoherent de- 
sign allowed for the condition number (9) to grow almost as fast as y/n. In 
contrast, if the sparsity s„ is larger than the maximal block-size, the UUP 
requires that the condition number c n be bounded from above by a positive 
constant. Using its form (10) and the corresponding bound (11) for the max- 
imal eigenvalue, it implies specifically that c n < 2, which is clearly stricter 
than the condition (8). 
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2.2.4. Incoherent designs and the irrepresentable condition. One might 
ask in what sense the notion of incoherent design is more general than the 
irrepresentable condition. At first, it might seem like we are simply replac- 
ing the strict condition of irrepresentable condition by a similarly strong 
condition on the design matrix. 

Consider first the classical case of a fixed number p n of variables. If the 
covariance matrix C = C n is converging to a positive definite matrix for large 
sample sizes, the design is automatically incoherent. On the other hand, it 
is easy to violate the irrepresentable condition in this case; for examples, see 



The notion of incoherent designs is only a real restriction in the high- 
dimensional case with p n > n. Even then, it is clear that the notion of 
incoherence is a relaxation from irrepresentable condition, as the irrepre- 
sentable condition can easily be violated even though all sparse eigenvalues 
are bounded well away from zero. 

2.3. Main result for high- dimensional data (p n > n). We first state our 
main result. 

Theorem 1 (Convergence in ^-norm). Assume the incoherent design 
condition (6) with a sparsity multiplier sequence e n . If A oc cre n ^nlogp n , 
there exists a constant M > such that, with probability converging to 1 for 
n —¥ oo, 



A proof is given in Section 3. It can be seen from the proofs that nonasymp- 
totic bounds could be obtained with essentially the same results. 

If we choose the smallest possible multiplier sequence e n , one obtains not 
only the required lower bound e n > 18</> max /</> mm (e^s n ) from (6) but also 
an upper bound e n < -f£^max/<^min(ej;,s n ). Plugging this into (12) yields the 
probabilistic bound, for some positive M, 



It is now easy to see that the convergence rate is essentially optimal as 
long as the relevant eigenvalues are bounded. 

Corollary 1. Assume that there exist constants < K m j n < K max < oo 



Zou [42]. 






such that 



lim inf 4> m i n {s n log n) > K min 



and 



(13) 



limsup0 max ( 



(s n + mm{n,p n }) < n max . 



n— >oo 



LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS 



11 



Then, for A oc o\Jn logp n , there exists a constant M > such that, with 
probability converging to 1 for n — > oo, 

\\P-P x -\\l<Ma 2 Snl0gPn . 

n 

The proof of this follows from Theorem 1 by choosing a constant sparsity 
multiplier sequence, for example, 20K max /Ac m i n . 

The rate of convergence achieved is essentially optimal. Ignoring the logp n 
factor, it corresponds to the rate that could be achieved with maximum 
likelihood estimation if the true underlying sparse model would be known. 

It is perhaps also worthwhile to make a remark about the penalty param- 
eter sequence A and its, maybe unusual, reliance on the sparsity multiplier 
sequence e n . If both the relevant minimal and maximal sparse eigenvalues 
in (6) are bounded from below and above, as in Corollary 1 above, the 
sequence e n is simply a constant. Any deviation from the usually optimal 
sequence A oc o~\Jn \ogp n occurs thus only if the minimal sparse eigenval- 
ues are decaying to zero for n — > oo, in which case the penalty parameter 
is increased slightly. The value of A can be computed, in theory, without 
knowledge about the true (3. Doing so in practice would not be a trivial 
task, however, as the sparse eigenvalues would have to be known. Moreover, 
the noise level a would have to be estimated from data, a difficult task for 
high-dimensional data with p n > n. Prom a practical perspective, we mostly 
see the results as implying that the ^-distance can be small for some value 
of the penalty parameter A along the solution path. 

2.4. Number of selected variables. As a result of separate interest, it is 
perhaps noteworthy that bounds on the number of selected variables are 
derived for the proof of Theorem 1. For the setting of Corollary 1 above, 
where a constant sparsity multiplier can be chosen, Lemma 5 implies that, 
with high probability, at most 0{s n ) variables are selected by the Lasso 
estimator. The selected subset is hence of the same order of magnitude as 
the set of "truly nonzero" coefficients. In general, with high probability, no 
more than e 2 n s n variables are selected. 

2.5. Sign consistency with two-step procedures. It follows from our re- 
sults above that the Lasso estimator can be modified to be sign consistent 
in a two-step procedure even if the irrepresentable condition is relaxed. All 
one needs is the assumption that nonzero coefficients of (3 are "sufficiently" 
large. One possibility is hard-thresholding of the obtained coefficients, ne- 
glecting variables with very small coefficients. This effect has already been 
observed empirically in [33]. Other possibilities include soft-thresholding and 
relaxation methods such as the Gauss-Dantzig selector [5] , the relaxed Lasso 
[22] with an additional thresholding step or the adaptive Lasso of Zou [42]. 
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Definition 5 (Hard-thresholded Lasso estimator). Let, for each x £ MP, 
the quantity l{|x| > c} be a p n -dimensional vector which is, componentwise, 
equal to 1 if \xk\ > c and otherwise. For a given sequence t n , the hard- 
thresholded Lasso estimator p ht ' X is defined as 

pht,x = fix 1{ px > atnS Ji ogPn/n} , 

The sequence t n can be chosen freely. We start with a corollary that 
follows directly from Theorem 1, stating that the hard-thresholded Lasso 
estimator (unlike the un-thresholded estimator) is sign consistent under 
regularity assumptions that are weaker than the irrepresentable condition 
needed for sign-consistency of the ordinary Lasso estimator. 

Corollary 2 (Sign consistency by hard thresholding). Assume the in- 
coherent design assumption (6) holds and the sparsity of (3 fulfills s n = 
o(i^e~ 4 ) for n — > oo. Assume furthermore 

min \/3 k \ > at n J\ogp n /n, n->oo. 

Under a choice A oc ae n yjn \ogp n , the hard-thresholded Lasso estimator of 
Definition 5 is then sign- consistent and 

P{sign0 ht ' X ) = sign(/?)} -> 1 as n -» oo. 

The proof follows from the results of Theorem 1. The bound (12) on 
the -^-distance, derived from Theorem 1, gives then trivially the identical 
bound on the squared ^-distance between f3 x and (5. The result follows 
by observing that l/</> m ax = 0(1) and the fact that error is a smaller 
order of the lower bound on the size of nonzero /3's due to assumptions of 
incoherent design and s n = o(t^e~ 4 ). When choosing a suitable value of the 
cut-off parameter t n , one is faced with a trade-off. Choosing larger values of 
the cut-off t n places a stricter condition on the minimal nonzero value of (3, 
while smaller values of t n relax this assumption, yet require the vector (3 to 
be sparser. 

The result mainly implies that sign- consistency can be achieved with the 
hard-thresholded Lasso estimator under much weaker consistency require- 
ments than with the ordinary Lasso estimator. As discussed previously, the 
ordinary Lasso estimator is only sign consistent if the irrepresentable condi- 
tion or, equivalently, neighborhood stability is fulfilled [23, 41, 42]. This is a 
considerably stronger assumption than the incoherence assumption above. 
In either similar assumption on the rate of decay of the minimal 

nonzero components is needed. 

In conclusion, even though one cannot achieve sign consistency in gen- 
eral with just a single Lasso estimation, it can be achieved in a two-stage 
procedure. 
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3. Proof of Theorem 1. Let j3 x be the estimator under the absence of 
noise, that is, [3 X = $ x, °, where f3 x ^ is defined as in (15). The ^-distance 
can then be bounded by \\(3 X - f3\\j 2 < 2||/3 A - (3 x \\j 2 + 2||/? A - The first 
term on the right-hand side represents the variance of the estimation, while 
the second term represents the bias. The bias contribution follows directly 
from Lemma 2 below. The bound on the variance term follows by Lemma 6 
below. 



De-noised response. Before starting, it is useful to define a de-noised 
response. Define for < £ < 1 the de-noised version of the response variable, 

(14) Y(0=X(3 + te. 

We can regulate the amount of noise with the parameter £. For £ = 0, only 
the signal is retained. The original observations with the full amount of noise 
are recovered for £ = 1. Now consider for < £ < 1 the estimator $ x ^, 

(15) /? A '« = argmin||y(£)-A7?|||+A||/3|| £l . 

& 

The ordinary Lasso estimate is recovered under the full amount of noise so 
that f3 X)1 = f3 x . Using the notation from the previous results, we can write 
for the estimate in the absence of noise, 

^A.o _ The definition of the 
de-noised version of the Lasso estimator will be helpful for the proof as it 
allows to characterize the variance of the estimator. 



3.1. Part I of proof: bias. Let K be the set of nonzero elements of (5, 
that is, K = {k : (3^ ^ 0}. The cardinality of K is again denoted by s = s n . 
For the following, let j3 x be the estimator under the absence of noise, 
as defined in (15). The solution fi x can, for each value of A, be written as 
[5 X = /3 + 7 A , where 

(16) 7 A = argmin/(C). 
The function f(Q) is given by 

(17) f(o=n( T cc+\Y, \Ck\+xJ2(\^+Ck\-m). 

keK c k&K 

The vector j x is the bias of the Lasso estimator. We derive first a bound on 
the ^2-norm of 7 A . 

Lemma 1. Assume incoherent design as in (6) with a sparsity multiplier 
sequence e n . The £2-norm of j x , as defined in (16), is then bounded for 
sufficiently large values of n by 

" H J ram\ c -n s nl 
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Proof. We write in the following 7 instead of 7 A for notational sim- 
plicity. Let 'j(K) be the vector with coefficients Jk(K) = 7fcl{& £ K}, that 
is, j(K ) is the bias of the truly nonzero coefficients. Analogously, let ~i{K c ) 
be the bias of the truly zero coefficients with ^^(K ) = 7fcl{A; ^ K}. Clearly, 
7 = j(K) + j(K c ). The value of the function f((), as defined in (17), is 
if setting £ = 0. For the true solution 7 A , it follows hence that /(7 A ) < 0. 
Hence, using that £ T CC > for any £, 



(19) h(K c )\\ h = E Kfc|< 

keK c 



£(l&+c*l-l&l) 

k£K 



<\h(K)\W- 



As ||7(i^)|U < s n , it follows that ||7(K)||^ < v^IItWII^ < V^IMk and 
hence, using (19), 

(20) IHk<2v^||7k- 

This result will be used further below. We use now again that /(7 A ) < 
[as £ = yields the upper bound f(() = 0]. Using the previous result 
that ^(iC)!!^ < y^lMlfa, and ignoring the nonnegative term ^(if )^, 
it follows that 

(21) n7 T C7<A^|| 7 |k. 

Consider now the term ^Cj. Bounding this term from below and plug- 
ging the result into (21) will yield the desired upper bound on the ^-norm 
of 7. Let |7(i)| > I T(2) I > • • • > |7(p)| De the ordered entries of 7. 

Let u n for n G N be a sequence of positive integers, to be chosen later, and 
define the set of the "u n -largest coefficients" as U = {k:\^fk\ > l7(u n ) I }• Define 
analogously to above the vectors j(U) and ^{U c ) by Jk{U) =7fcl{/c G U} 
and 7fc(C/ c ) =7fcl{fc ^ U}. The quantity 7 T C7 can be written as r y T C"f = 
||a + 6||| 2 , where a := n'^X^U) and b := n-^ 2 Xj(U c ). Then 

(22) 7 T C7=||a + 6|||>(||a||, 2 -||6||, 2 ) 2 . 

Before proceeding, we need to bound the norm ||7(£/ c )||£ 2 as a function of 
u n . Assume for the moment that the £i-norm H7H4 is identical to some 
£ > 0. Then it holds for every k = l,...,p that 7^) < £/k. Hence, 

(23) h(u c )\\i<h\\l £ ^<{Mnh\\D—, 

k=u n +l K Un 

having used the result (20) from above that W^/W^ < 2 -^/sTT 1 1 T 1 1 ^2 - As l{U) 
has by definition only u n nonzero coefficients, 

\Hl = hiufc^ml > ^ mia (u n )Hu)\\l 

(24) 4s 

> <t>mm(u n )(l - -^j \h\\%, 
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having used (23) and ||7(J7)||| a = ||7||| - ||7(^ c )||? 2 - As j(U c ) has at most 
min{n,p} nonzero coefficients and using again (23), 

(25) H&ll! = h(U C ) T C~f(U C )\\i < <A max || 7 (f/ C )||| < <Amax — h\\l- 

Using (24) and (25) in (22), together with max > 0min(w n ), 



(26) 7^7 > <W«n)|H& U ~ J S f m : K X 

Choosing for u n the sparsity multiplier sequence, as defined in (6), times the 
sparsity s n , so that u n — e n s n it holds that Sn^max/ ( e n Sn4>ra\n{enSn)) < 1/18 
and hence also that s n ^ max / '(e n s n< /> min (e;;s n )) < 1/18, since <?W n (e^s n ) < 
0min(e n Sn)- Thus the right-hand side in (26) is bounded from below by 
18<^min(e n 'Sn)||7|l! 2 since (1 — 4/a/18) < 17.5. Using the last result together 
with (21), which says that j T C~/< n~ 1 \,/s^\\"f\\e 2 , it follows that for large 
n, 

Il7lk<17.5- , r , 
which completes the proof. □ 

Lemma 2. Under the assumptions of Theorem 1, the bias ||7 A ||f 2 is 
bounded by 

Proof. This is an immediate consequence of Lemma 1. Plugging the 
penalty sequence A oc uyfn logp n e n into (18), the results follows by the in- 
equality <pmm{enSn) > 4'mm(^n s n) > having used that, by its definition in (6), 
e n is necessarily larger than 1. □ 

3.2. Part II of proof: variance. The proof for the variance part needs 
two steps. First, a bound on the variance is derived, which is a function 
of the number of active variables. In a second step, the number of active 
variables will be bounded, taking into account also the bound on the bias 
derived above. 

Variance of restricted OLS. Before considering the Lasso estimator, a 
trivial bound is shown for the variance of a restricted OLS estimation. Let 
9 M G W be, for every subset M C {1, . . . ,p} with \M\ < n, the restricted 
OLS-estimator of the noise vector e, 

(27) 6 M = (X T M X M y l X T M e. 
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First, we bound the ^-norm of this estimator. The result is useful for bound- 
ing the variance of the final estimator, based on the derived bound on the 
number of active variables. 

Lemma 3. Let rn n be a sequence with m n = o{n) and fn n — > oo for n — > 
oo. If p n —> oo, it holds with probability converging to 1 for n — > oo 

|mM||2 / 21 °SPn ™n 2 

max W | < -5 — j—rr°~ ■ 

M;\M\<mn n 9^in{m n ) 

The ^2-norm of the restricted estimator 9 M is thus bounded uniformly 
over all sets M with \M\ <m n . 

Proof of Lemma 3. It follows directly from the definition of 9 M that, 
for every M with \M\ < m n , 

(28) ^^-n^) 1 ^*" 

It remains to be shown that, for n— > oo, with probability converging to 1, 
max \\Xj d e\\j <2logp n a 2 m n n. 

M:\M\<m n 2 

As £j ~ JV(0, a 2 ) for alH = 1, . . . , n, it holds with probability converging to 1 
for n — > oo, by Bonferroni's inequality that max,t<p„ l^fc^P is bounded from 
above by 2\ogp n a 2 n. Hence, with probability converging to 1 for n— > oo, 

(29) max LY M e L < m n max LY fc e < 2 log p n a nm n , 

M:\M\<m„ k<p n 

which completes the proof. □ 

Variance of estimate is bounded by restricted OLS variance. We show 
that the variance of the Lasso estimator can be bounded by the variances of 
restricted OLS estimators, using bounds on the number of active variables. 

Lemma 4. //, for a fixed value of X, the number of active variables of 
the de-noised estimators (3^ is for every < £ < 1 bounded by m, then 

(30) sup ||/3 A >° - P^\\ 2 2 < max ||0 M ||, 2 2 . 

0<§<1 M:\M\<m 

Proof. The key in the proof is that the solution path of (3 X ^, if increas- 
ing the value of £ from to 1, can be expressed piecewise in terms of the 
restricted OLS solution. It will be obvious from the proof that it is sufficient 
to show the claim for £ = 1 in the term on the r.h.s. of (30). 
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The set M(£) of active variables is the set with maximal absolute gradient, 

M(0 = {k:\G X /\=\}. 

Note that the estimator and also the gradient G^ are continuous func- 
tions in both A and £ [11]. Let = £i < £2 < ■ • ■ < £l+i = 1 be the points of 
discontinuity of M(£). At these locations, variables either join the active set 
or are dropped from the active set. 

Fix some j with 1 < j < J. Denote by Mj the set of active variables M(£) 
for any £ £ (£j,£j+i). We show in the following that the solution is for 
all £ in the interval (£j,£j+i) given by 

(31) V£ G (£;, £ j+ i) : /3 A '« = /? A '^ + (£ - , 

where # Ma ' is the restricted OLS estimator of noise, as defined in (27). The 
local effect of increased noise (larger value of £) on the estimator is thus 
to shift the coefficients of the active set of variables along the least squares 
direction. 

Once (31) is shown, the claim follows by piecing together the piecewise 
linear parts and using continuity of the solution as a function of £ to obtain 

||/3 A '°-/3 A ' 1 |k<El|/3 A ^-/3 A ^+ 1 |k 2 

j 

Mm V^//- t \ _ ™„„ llflAfn 



max 



M:|A/|<m M: |M|<m 

3=1 

It thus remains to show (31). A necessary and sufficient condition for f3 x ^ 
with £ E (£j)£j+i) to be a valid solution is that for all k £ Mj with nonzero 
coefficient 0^ / 0, the gradient is equal to A times the negative sign, 

(32) G A '« = -Asign(/5 A '«), 

that for all variables with k S Mj with zero coefficient /3jM = the gradient 
is equal in absolute value to A 

(33) \G X /\ = X 
and for variables k ^ Mj not in the active set, 

(34) \G^\ < A. 



These conditions are a consequence of the requirement that the subgradient 
of the loss function contains for a valid solution. 
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Note that the gradient of the active variables in Mj is unchanged if re- 
placing f e by some £' E (£j, £j+i) and replacing (3 X ^ by /3 A - ? + (£' - 
£)0 M -?. That is, for all A; £ Mj, 

(X(o - x$^fx k = {¥(?) - x0 x * + (e - oe M *)} T x k , 

as the difference of both sides is equal to (£' — £){(e ~~ X9 M i) J X^}, and 
(e - X0 M i) T X k = for all k £ Mj, as (9 A/ ^ is the OLS of e, regressed on the 
variables in Mj. Equalities (32) and (33) are thus fulfilled for the solution 
and it remains to show that (34) also holds. For sufficiently small values of 
£' — £, inequality (34) is clearly fulfilled for continuity reasons. Note that 
if l£' — £1 is large enough such that for one variable k ^ Mj inequality (34) 
becomes an equality, then the set of active variables changes and thus either 
£' = Cj+i or £' = £r We have thus shown that the solution (3 X ^ can for all 
£ £ be written as 

/3 A ' e = /3 A ^' + (£-O)0 Mj , 
which proves (31) and thus completes the proof. □ 

A bound on the number of active variables. A decisive part in the vari- 
ance of the estimator is determined by the number of selected variables. 
Instead of directly bounding the number of selected variables, we derive 
bounds for the number of active variables. As any variable with a nonzero 
regression coefficient is also an active variable, these bounds lead trivially 
to bounds for the number of selected variables. 

Let A\ be the set of active variables, 

A x = {k:\G x \ = X}. 

Let A\£ be the set of active variables of the de- noised estimator f3 x ^, as 
defined in (15). The number of selected variables (variables with a nonzero 
coefficient) is at most as large as the number of active variables, as any 
variable with a nonzero estimated coefficient has to be an active variable 
[25]. 

Lemma 5. For A > o~e n y/n logp n , the maximal number sup <^<i |-4a,£| 
of active variables is bounded, with probability converging to 1 for n — > oo, 
by 

sup \A\^\ < e\s n . 
0<£<i 

Let R x, £ be the residuals of the de-noised estimator (15), R x ^ = 
For any k in the \A\£ | -dimensional space spanned by the active 

\X^R X ^\ = A. 



Proof. 

Y-X[3 X 't. 
variables, 

(35) 
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Adding up, it follows that for all < £ < 1, 
(36) \Ax,t\\ 2 = \\X^ x R X '%. 



The residuals can for all values < £ < 1 be written as the sum of two 
terms, R x ^ = X(0 - X 't) + fe. Equality (36) can now be transformed into 
the inequality, 

(37) |.4 A , 5 |A 2 < (\\X T Ax X(P - $^)\\ i2 + e\\X^e\U 2 ) 2 

(38) < (||Xl X(0 - (3^)\\ l2 + \\X T . e\\, 2 f. 



Denote by rh the supremum of |«4.a,£ I over a h values of < £ < 1. Using the 
same argument as in the derivation of (29), the term sup 0< ^ <1 \\X A e|| 2 2 is 

of order o p (mn \ogp n ) as long as p n — ► oo for n — ► oo. For sufficiently large n 
it holds thus, using A > ae n ^n\ogp n , that sup 0< ^ <1 \\X^ x ^e\\n 2 / {rh\ 2 ) 1 / 2 < 
r] for any rj > 0. Dividing by A 2 , (37) implies then, with probability converg- 
ing to 1, 

(39) rh< sup (A^HX^ X(0 - X '^)\\i 2 + 7]Vtii) 2 . 

o<5<i 

Now turning to the right-hand side, it trivially holds for any value of < £ < 
1 that \A x ,s\ <min{ra,p}. On the other hand, X{0 - x >t) = X Bx ^(0 - X '^), 
where B\£ := A\^U{k : 0^ ^ 0}, as the difference vector — X ^ has nonzero 
entries only in the set £>a.£- Thus 

\\x T Ax x{p-^)\\l < \\x^x Bx jp-p^)\\j 2 . 

Using additionally \B\,^\ < s n + min{n,p}, it follows that 

llx^x^-^ii^n^LxlK/?-^)!!!- 

Splitting the difference - X ^ into (0 - X ) + (0 X - X ^), where X = X >° 
is again the population version of the Lasso estimator, it holds for any r/ > 0, 
using (39), that with probability converging to 1 for n — ► oo, 

(40) m< (n\- 1 (f> max \\0-0 x \\ e2 +n\- 1 (t> max sup ||/3 A '° - X '% 2 + rjVfh 

\ 0<£<1 

Using Lemmas 3 and 4, the variance term ra 2 2 nax sup o< ^ <1 \\0 X, ° — /3 A '^|| 2 2 
is bounded by o p {nmlogp n (/) 2 nax /(/) 2 nin (m)}. Define, implicitly, a sequence 
A = a \Jn log p n ((/> max / 4>mm ifh))- For any sequence A with liminfn^oo A/A > 
0, the term n 2 A~ 2 ^ 2 nax sup <^<! ||/3 A, ° — X, ^\\1 2 is then of order o p (rh). Using 
furthermore the bound on the bias from Lemma 1, it holds with probability 
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converging to 1, for n — > oo for any sequence A with liminfn^oo A/A > and 
any r\ > that 

m<(n\- 1 ^ max \\(3-P x \\ h +2 V Vrhf < (l7.50 max - ^ + 2 V Vfh) . 

Choosing rj = 0.013 implies, for an inequality of the form a 2 < (x + 2r]a) 2 , 
that a < (18/17. 5)x. Hence, choosing this value of rj, it follows from the 
equation above that, with probability converging to 1 for n — > oo, 

fa < 18 2 2 g n — c 2 J / 18</>max \ 2 ^ g 2 g 

having used the definition of the sparsity multiplier in (6). We can now 
see that the requirement on A, namely liminfn—x^o A/A > 0, is fulfilled if 
A > ae n y/n logp n , which completes the proof. □ 

Finally, we use Lemmas 3, 4 and 5 to show the bound on the variance of 
the estimator. 



Lemma 6. Under the conditions of Theorem 1, with probability converg- 
ing to 1 for n — > oo, 

,A|,2 / O _2 s nl0gPn 



n ^min( e nSn)' 



The proof follows immediately from Lemmas 3 and 4 when inserting the 
bound on the number of active variables obtained in Lemma 5. 



4. Numerical illustration: frequency detection. Instead of extensive nu- 
merical simulations, we would like to illustrate a few aspects of Lasso-type 
variable selection if the irrepresentable condition is not fulfilled. We are 
not making claims that the Lasso is superior to other methods for high- 
dimensional data. We merely want to draw attention to the fact that (a) 
the Lasso might not be able to select the correct variables but (b) comes 
nevertheless close to the true vector in an ^-sense. 

An illustrative example is frequency detection. It is of interest in some 
areas of the physical sciences to accurately detect and resolve frequency com- 
ponents; two examples are variable stars [27] and detection of gravitational 
waves [6, 32]. A nonparametric approach is often most suitable for fitting of 
the involved periodic functions [15]. However, we assume here for simplicity 
that the observations Y = (Yi, . . . , Y n ) at time points t = (t\, . . . , t n ) are of 
the form 

Yi = Pu sm(2iruti + <j>J) + £j, 
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where Q contains the set of fundamental frequencies involved, and for i = 
1, ... ,n is independently and identically distributed noise with e$ ~ JV(0, a 2 ). 
To simplify the problem even more, we assume that the phases are known 
to be zero, (ftu — for all oj £ £1. Otherwise one might like to employ the 
Group Lasso [37], grouping together the sine and cosine part of identical 
frequencies. 

It is of interest to resolve closely adjacent spectral lines [16] and we will 
work in this setting in the following. We choose for the experiment n = 
200 evenly spaced observation times. There are supposed to be two closely 
adjacent frequencies with uj\ = 0.0545 and UJ2 = 0.0555 = u± + 1/300, both 
entering with j3 Wl = (3 W2 = 1. As we have the information that the phase is 
zero for all frequencies, the predictor variables are given by all sine- functions 
with frequencies evenly spaced between 1/200 and 1/2, with a spacing of 
1/600 between adjacent frequencies. 

In the chosen setting, the irrepresentable condition is violated for the 
frequency uj m = {oj\ +u)2)/2. Even in the absence of noise, this resonance 
frequency is included in the Lasso-estimate for all positive penalty param- 
eters, as can be seen from the results further below. As a consequence of 
a violated irrepresentable condition, the largest peak in the periodogram is 
in general obtained for the resonance frequency. In Figure 1 we show the 
periodogram [28] under a moderate noise level a = 0.2. The periodogram 
shows the amount of energy in each frequency, and is defined through the 
function 

i i 

where Y^ w ' is the least squares fit of the observations Y, using only sine 
and cosine functions with frequency u> as two predictor variables. There 
is clearly a peak at frequency uj m . As can be seen in the close-up around 
LJ m , it is not immediately obvious from the periodogram that there are two 
frequencies at frequencies uj\ and U2- As said above, the irrepresentable 
condition is violated for the resonance frequency and it is of interest to see 
which frequencies are picked up by the Lasso estimator. 

The results are shown in Figures 2 and 3. Figure 3 highlights that the 
two true frequencies are with high probability picked up by the Lasso. The 
resonance frequency is also selected with high probability, no matter how 
the penalty is chosen. This result could be expected as the irrepresentable 
condition is violated and the estimator can thus not be sign consistent. We 
expect from the theoretical results in this manuscript that the coefficient of 
the falsely selected resonance frequency is very small if the penalty parameter 
is chosen correctly. And it can indeed be seen in Figure 2 that the coefficients 
of the true frequencies are much larger than the coefficient of the resonance 
frequency for an appropriate choice of the penalty parameter. 
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Fig. 1. The energy log AE(ui) for a noise level a = 0.2 is shown on the left for a range 
of frequencies w. A close-up of the region around the peak is shown on the right. The two 
frequencies uii and L02 are marked with solid vertical lines, while the resonance frequency 
(wi +o>2)/2 is shown with a broken vertical line. 

These results reinforce our conclusion that the Lasso might not be able 
to pick up the correct sparsity pattern, but delivers nevertheless useful ap- 
proximations as falsely selected variables are chosen only with a very small 
coefficient; this behavior is typical and expected from the results of Theorem 
1. Falsely selected coefficients can thus be removed in a second step, either 
by thresholding variables with small coefficients or using other relaxation 
techniques. In any case, it is reassuring to know that all important variables 
are included in the Lasso estimate. 

5. Concluding remarks. It has recently been discovered that the Lasso 
cannot recover the correct sparsity pattern in certain circumstances, even 
not asymptotically for p n fixed and n — > 00. This sheds a little doubt on 
whether the Lasso is a good method for identification of sparse models for 
both low- and high-dimensional data. 

Here we have shown that the Lasso can continue to deliver good approx- 
imations to sparse coefficient vectors (3 in the sense that the ^-difference 

— $ x I \g 2 vanishes for large sample sizes n, even if it fails to discover the 
correct sparsity pattern. The conditions needed for a good approximation in 
the ^2-sense are weaker than the irrepresentable condition needed for sign 
consistency. We pointed out that the correct sparsity pattern could be recov- 
ered in a two-stage procedure when the true coefficients are not too small. 
The first step consists in a regular Lasso fit. Variables with small absolute 
coefficients are then removed from the model in a second step. 

We derived possible scenarios under which ^-consistency in the sense of 
(4) can be achieved as a function of the sparsity of the vector (3, the number 
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Fig. 2. An example where the Lasso is bound to select wrong variables, while being a 
good approximation to the true vector in the li-sense. Top row: The noise level increases 
from left to right as a — 0, 0.1, 0.2, 1. For one run of the simulation, paths of the estimated 
coefficients are shown as a function of the square root y/X of the penalty parameter. The 
actually present signal frequencies wi and tU2 are shown as solid lines, the resonance fre- 
quency as a broken line, and all other frequencies are shown as dotted lines. Bottom row: 
The shaded areas contain, for 90% of all simulations, the regularization paths of the signal 
frequencies (region with solid borders), resonance frequency (area with broken borders) and 
all other frequencies (area with dotted boundaries). The path of the resonance frequency 
displays reverse shrinkage as its coefficient gets, in general, smaller for smaller values of 
the penalty. As expected from the theoretical results, if the penalty parameter is chosen 
correctly, it is possible to separate the signal and resonance frequencies for sufficiently low 
noise levels by just retaining large and neglecting small coefficients. It is also apparent 
that the coefficient of the resonance frequency is small for a correct choice of the penalty 
parameter but very seldom identically zero. 



of samples and the number of variables. Under the condition that sparse 
minimal eigenvalues are not decaying too fast in some sense, the requirement 
for ^-consistency is (ignoring logn factors) 

Snlogp„ _ 

> U as n — > oo. 

n 

The rate of convergence is actually optimal with an appropriate choice of 
the tuning parameter A and under the condition of bounded maximal and 
minimal sparse eigenvalues. This rate is, apart from logarithmic factor in p n 
and n, identical to what could be achieved if the true sparse model would be 
known. If ^-consistency is achieved, the Lasso is selecting all "sufficiently 
large" coefficients, and possibly some other unwanted variables. "Sufficiently 
large" means here that the squared size of the coefficients is decaying slower 
than the rate n _1 s n logp n , again ignoring logarithmic factors in the sample 
size. The number of variables can thus be narrowed down considerably with 
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Fig. 3. The top row shows the l^-distance between f3 and /3 A separately for the signal 
frequencies (solid blue line), resonance frequency (broken red line) and all other frequencies 
(dotted gray line). It is evident that the distance is quite small for all three categories 
simultaneously if the noise level is sufficiently low (the noise level is again increasing from 
left to right as a — 0, 0.1, 0.2, 1). The bottom row shows, on the other hand, the average 
number of selected variables (with nonzero estimated regression coefficient) in each of the 
three categories as a function of the penalty parameter. It is impossible to choose the correct 
model, as the resonance frequency is always selected, no matter how low the noise level 
and no matter how the penalty parameter is chosen. This illustrates that sign consistency 
does not hold if the irrepresentable condition is violated, even though the estimate can be 
close to the true vector (5 in the I2 -sense. 

the Lasso in a meaningful way, keeping all important variables. The size of 
the reduced subset can be bounded with high probability by the number 
of truly important variables times a factor that depends on the decay of 
the sparse eigenvalues. This factor is often simply the squared logarithm 
of the sample size. Our conditions are similar in spirit to those in related 
aforementioned works, but expand the ground to cover possibly cases with 
more dependent predictors than UUP. These results support that the Lasso 
is a useful model identification method for high-dimensional data. 
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